Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster room joins: fix race in recalculation of current room state #13151

Merged

Conversation

squahtx
Copy link
Contributor

@squahtx squahtx commented Jul 1, 2022

When we finish un-partial stating all events in a room, we recalculate
the current state using forward extremities. This can race with
persistence of another event, which could result in an invalid current
room state in the database.

To avoid the race, we recalculate current room state in the same
queue as event persistence. The event persistence queue may be on
another worker, so a new replication endpoint is required as well.

Fixes #13007.


May be easiest to review commit by commit.

Sean Quah added 4 commits June 30, 2022 23:51
This moves us closer to fixing the race between recalculation of a
room's current state and event persistence. The next step is to move
recalculation of current state into the event persistence queue.

Signed-off-by: Sean Quah <seanq@matrix.org>
Avoid races between event persistence and recalculation of a room's
current state by putting them in the same queue.

Signed-off-by: Sean Quah <seanq@matrix.org>
Signed-off-by: Sean Quah <seanq@matrix.org>
@squahtx squahtx requested a review from a team as a code owner July 1, 2022 10:48
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are failing alas

Comment on lines +56 to +57
async def _serialize_payload(room_id: str) -> JsonDict: # type: ignore[override]
return {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can drop this since returning an empty dict is the default implementation in the base class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python complains because the base implementation is an abstractmethod

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh boo

self._events_shard_config = hs.config.worker.events_shard_config
self._instance_name = hs.get_instance_name()

self._update_current_state = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth naming this _client to make it obvious what the difference is between it and update_current_state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed!

end_item = queue[-1]
existing_task = queue[-1].task
# add our events to the existing queue item
existing_task.events_and_contexts.extend(task.events_and_contexts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be cleaner to have a def try_update(..) -> bool function in _EventPersistQueueTask that encapsulates this logic? For _UpdateCurrentStateTask it'd simply always return false (i.e. update failed), and in _PersistEventsTask have this check? Though that might complicate things unnecessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go at this

@squahtx
Copy link
Contributor Author

squahtx commented Jul 5, 2022

TestRestrictedRoomsLocalJoin and TestSendJoinPartialStateResponse are known worker mode flakes: #13161

@squahtx squahtx requested a review from erikjohnston July 5, 2022 15:03
@squahtx squahtx enabled auto-merge (squash) July 7, 2022 11:50
@squahtx squahtx merged commit 1391a76 into develop Jul 7, 2022
@squahtx squahtx deleted the squah/faster_room_joins_fix_current_state_recalculation_race branch July 7, 2022 12:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster joins: fix race in calcuating "current state"
2 participants